Unsupervised pattern identification in spatial gene expression atlas reveals mouse brain regions beyond established ontology

Significance A comprehensive understanding of the spatial organization and differentiation of the mammalian brain requires interpreting 3D structural and molecular information in biologically plausible ways. At the moment, reliable computational methods and workflows are still lacking for reproducible analysis of gene expression data that incorporates existing domain knowledge of the brain structure. In this work, we combined stability-driven nonnegative matrix factorization with spatial correlation analysis to analyze the 3D spatial gene expression of the entire adult mouse brain. Our approach connects data-driven methods with spatial domain knowledge, revealing a gene expression-defined anatomical ontology and interpretable region-specific genetic architecture captured by the marker genes and spatial coexpression networks.


SI-1. Moran's I
To quantify the spatial coherence of PPs, we used Moran's I statistic (1).It was originally used in geostatistics and has more recently been used in spatial gene expression literature (2).Moran's I ranges in value from -1 to 1.A value close to -1 indicates little spatial organization, similar to a chess board with black and white squares distributed across the board.A value close to 1 indicates a clear spatially distinct pattern, such as if all the black squares in a chess board were on one side and all white squares on the other.We calculated Moran's I as follows (2): Here, xi and xj represent the PP coefficient at voxel locations i and j, respectively. is the mean gene expression level of each PP.N is the total number of voxel locations, wij is the spatial adjacency relationship (based on the adjacency matrix, w) between voxels i and j.W is the sum of all entries in w, which represents the cumulative total adjacencies.We mask the dataset to only include the brain region.Then, for each voxel, we select up to 6 voxels for determining adjacency (up, down, left, right, forward, background, where available), following the "rook" definition of neighborhood.We assign wij=1 if voxel j is adjacent to , and wij=0 otherwise.Given the large size of the adjacency matrix (159,326 x 159,326), we downsampled the PPs by removing every other row in each of the three dimensions to improve computational efficiency.Given certain voxels had multiple PPs with small but non-zero coefficients, we assigned each voxel in the brain map to the PP with the highest coefficient for that voxel.This ensures that unique voxels are not represented by multiple PPs.

SI-2. 3D visualizations of PPs
The 3D gene visualizations were performed using Napari viewer, a multi-dimensional image viewer for Python (3).Key settings in Napari for PPs included: opacity=1, gamma=1, blending='additive', depiction='volume', and rendering='attenuated MIP'.MIP stands for maximum intensity projection, which enhances the 3D representation of objects.We moved the slide bar to 20% from the left side for 'attenuated MIP.'

SI-3. Algorithm 1: Spatial neighborhood query (pairwise in 3D)
Algorithm 1 Spatial neighborhood query (pairwise in 3D)       Table S1: Region-specific marker genes from ISH and scRNA-seq data.Comparison of the PP-level (from ISH data) (7) and cell type-specific (from scRNA-seq data) (8) marker genes in the same spatial region of the adult mouse brain.The cell types subclass labels are from the Common Cell Type Nomenclature (CCN) (9).

Figure S1 :
Figure S1: Stability analysis with Amari-type error function.Instability score of staNMF PPs and PCA PPs across 100 runs for each  value, from 8 to 30 for ABA dataset.The error bars are the standard deviation.This figure uses Amari-type error (4), while Fig. 1B uses the Hungarian matching method (5).Both approaches identify  = 11 for the minimum instability score (and thus most stability) for staNMF PPs.

Figure S2 :
Figure S2: Moran's I per PP from staNMF and PCA.The plot uses data from 20 bootstrap simulations for each PP, for a total of 220 simulations for staNMF PPs and 220 simulations for PCA PPs.The mean Moran's I was 0.58 ± 0.12 for staNMF and 0.47 ± 0.15 for PCA.The p-value between the two samples was <0.001.The PPs from staNMF show greater spatial coherence, or higher Moran's I (1), than those from PCA for all but one case (PP8).

Figure S3 :
Figure S3: Similarity of PCA PPs to the expert-annotated brain regions.A. 11 PPs generated by PCA, ordered based on highest coarse region correlation to the CCFv3 ontology (6) in 3D and projected on the coronal plane.B. Heat map of the correlation coefficient between PCA PPs and the most similar combination of CCF regions (with the highest correlation coefficient).

Figure S4 :
Figure S4: Metrics for region-level comparison between staNMF PPs and the CCF.The PPs from staNMF and CCF regions (6) are compared using the Dice similarity (top) and the Pearson correlation (bottom) visualized as the size of the filled circle.The PPs and CCF regions are arranged the same way as in Fig. 2 and Fig. S5.

Figure S5 :
Figure S5: Summary of staNMF PPs linked to the best-fit combination of Allen CCF regions.The 10 PPs from main text Fig. 3 mapped to their best-fit combinations of CCF regions (6).

Figure S6 :
Figure S6: Putative spatial gene co-expression network construction, continued.Extension of Fig. 6 of the main text, the sGCNs from PPs 8-11 and their associated brain regions from the CCFv3 (6) are shown.The node color presents the selectivity of the gene to the PP associated with the brain region.An edge is drawn between genes if the similarity score is among the top 5% of all similarity scores for that gene subset.The edge color is proportional to the Pearson correlation of the reconstructed gene expression images of the two co-expressed genes.